28 research outputs found

    SEED: efficient clustering of next-generation sequences.

    Get PDF
    MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online

    Sequence Analysis of the Potato Aphid \u3cem\u3eMacrosiphum euphorbiae\u3c/em\u3e Transcriptome Identified Two New Viruses

    Get PDF
    The potato aphid, Macrosiphum euphorbiae, is an important agricultural pest that causes economic losses to potato and tomato production. To establish the transcriptome for this aphid, RNA-Seq libraries constructed from aphids maintained on tomato plants were used in Illumina sequencing generating 52.6 million 75±105 bp paired-end reads. The reads were assembled using Velvet/Oases software with SEED preprocessing resulting in 22,137 contigs with an N50 value of 2,003bp. After removal of contigs from tomato host origin, 20,254 contigs were annotated using BLASTx searches against the non-redundant protein database from the National Center for Biotechnology Information (NCBI) as well as IntereProScan. This identified matches for 74% of the potato aphid contigs. The highest ranking hits for over 12,700 contigs were against the related pea aphid, Acyrthosiphon pisum. Gene Ontology (GO) was used to classify the identified M. euphorbiae contigs into biological process, cellular component and molecular function. Among the contigs, sequences of microbial origin were identified. Sixty five contigs were from the aphid bacterial obligate endosymbiont Buchnera aphidicola origin and two contigs had amino acid similarities to viruses. The latter two were named Macrosiphum euphorbiae virus 2 (MeV-2) and Macrosiphum euphorbiae virus 3 (MeV-3). The highest sequence identity to MeV-2 had the Dysaphis plantaginea densovirus, while to MeV-3 is the Hubei sobemo-like virus 49. Characterization of MeV-2 and MeV-3 indicated that both are transmitted vertically from adult aphids to nymphs. MeV-2 peptides were detected in the aphid saliva and only MeV-2 and not MeV-3 nucleic acids were detected inside tomato leaves exposed to virus-infected aphids. However, MeV-2 nucleic acids did not persist in tomato leaf tissues, after clearing the plants from aphids, indicating that MeV-2 is likely an aphid virus

    SEED: efficient clustering of next-generation sequences.

    No full text

    HALC: High throughput algorithm for long read error correction

    Get PDF
    Abstract Background The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. Results Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region’s repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads’ alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms. Conclusions The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc

    Improved hybrid method for image super‐resolution

    No full text
    Improving image resolution has broad applications and is an important research topic. Recently, a hybrid method Adaptive Sparse Domain Selection (ASDS) combining a reconstruction‐based method and an example‐based method has been proposed to take advantage of the two, but may not reconstruct sufficient details. In this study, the authors propose to improve ASDS: Zeyde's method is first used to obtain an intermediate image with high‐frequency details, and then the obtained image is used to replace the autoregressive model of ASDS as the example‐based term. In addition, the authors may split the input image into patches and use different parameter settings for the patches of different amount of details. Experimental results demonstrate the improved hybrid methods can produce high‐quality images quantitatively and perceptually

    AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references.

    No full text
    MotivationDe novo assemblies of genomes remain one of the most challenging applications in next-generation sequencing. Usually, their results are incomplete and fragmented into hundreds of contigs. Repeats in genomes and sequencing errors are the main reasons for these complications. With the rapidly growing number of sequenced genomes, it is now feasible to improve assemblies by guiding them with genomes from related species.ResultsHere we introduce AlignGraph, an algorithm for extending and joining de novo-assembled contigs or scaffolds guided by closely related reference genomes. It aligns paired-end (PE) reads and preassembled contigs or scaffolds to a close reference. From the obtained alignments, it builds a novel data structure, called the PE multipositional de Bruijn graph. The incorporated positional information from the alignments and PE reads allows us to extend the initial assemblies, while avoiding incorrect extensions and early terminations. In our performance tests, AlignGraph was able to substantially improve the contigs and scaffolds from several assemblers. For instance, 28.7-62.3% of the contigs of Arabidopsis thaliana and human could be extended, resulting in improvements of common assembly metrics, such as an increase of the N50 of the extendable contigs by 89.9-94.5% and 80.3-165.8%, respectively. In another test, AlignGraph was able to improve the assembly of a published genome (Arabidopsis strain Landsberg) by increasing the N50 of its extendable scaffolds by 86.6%. These results demonstrate AlignGraph's efficiency in improving genome assemblies by taking advantage of closely related references.Availability and implementationThe AlignGraph software can be downloaded for free from this site: https://github.com/baoe/AlignGraph

    Additional file 1 of HALC: High throughput algorithm for long read error correction

    No full text
    Figure S1. Percentage of genome above various long read coverages on the A. thaliana data. Figure S2. Percentage of genome above various long read coverages on the Maylandia zebra data. Table S1. Data sets used in the evaluation. Table S2. Running time and memory usage in the evaluation of error correction performance. (PDF 85.9 kb
    corecore